Discovery of Convoys in Trajectory Databases 



Hoyoung Jeungt Man Lung Yiu* Xiaofang Zhout Christian S. Jensen 1 Heng Tao Shen 1 

tThe University of Queensland *Department of Computer Science 
National ICT Australia (NICTA), Brisbane Aalborg University, Denmark 

{hoyoung, zxf, shenht}@itee. uq.edu.au {mly, csj}@cs. aau.dk 



ABSTRACT 

As mobile devices with positioning capabilities continue to pro- 
liferate, data management for so-called trajectory databases that 
capture the historical movements of populations of moving ob- 
jects becomes important. This paper considers the querying of such 
databases for convoys, a convoy being a group of objects that have 
traveled together for some time. 

More specifically, this paper formalizes the concept of a convoy 
query using density-based notions, in order to capture groups of 
arbitrary extents and shapes. Convoy discovery is relevant for real- 
life applications in throughput planning of trucks and carpooling 
of vehicles. Although there has been extensive research on tra- 
jectories in the literature, none of this can be applied to retrieve 
correctly exact convoy result sets. Motivated by this, we develop 
three efficient algorithms for convoy discovery that adopt the well- 
known filter-refinement framework. In the filter step, we apply line- 
simplification techniques on the trajectories and establish distance 
bounds between the simplified trajectories. This permits efficient 
convoy discovery over the simplified trajectories without missing 
any actual convoys. In the refinement step, the candidate convoys 
are further processed to obtain the actual convoys. Our comprehen- 
sive empirical study offers insight into the properties of the paper's 
proposals and demonstrates that the proposals are effective and ef- 
ficient on real-world trajectory data. 



1. INTRODUCTION 

Although the mobile Internet is still in its infancy, very large 
volumes of position data from moving objects are already being 
accumulated. For example, Inrix, Inc. based in Kirkland, WA re- 
ceive real-time GPS probe data from more than 650,000 commer- 
cial fleet, delivery vehicles, and taxis |T). As the mobile Internet 
continues to proliferate and as congestion becomes increasingly 
widespread across the globe, the volumes of position data being 
accumulated are likely to soar. Such data may be used for many 
purposes, including travel-time prediction, re-routing, and the iden- 
tification of ride-sharing opportunities. This paper addresses one 
particular challenge to do with the extraction of meaningful and 
useful information from such position data in an efficient manner. 
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The movement of an object is given by a continuous curve in the 
(space, time) domain, termed a trajectory. The past trajectory of 
an object is typically approximated based on a collection of time- 
stamped positions, e.g., obtained from a GPS device. As an exam- 
ple, Figure QJa) depicts the trajectories of four objects oi, 02, 03, 
and 04 in (x, y, i) space. 

Given a collection of trajectories, it is of interest to discover 
groups of objects that travel together for more than some minimum 
duration of time. A number of applications may be envisioned. The 
identification of delivery trucks with coherent trajectory patterns 
may be used for throughput planning. The discovery of common 
routes among commuters may be used for the scheduling of collec- 
tive transport. The identification of cars that follow the same routes 
at the same time may be used for the organization of carpooling, 
which may reduce congestion, pollution, and CO2 emissions. 
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Figure 1: Lossy-flock Problem 

The discovery of so-called flocks ll5l ll3|[T4l has received some 
attention. A flock is a group of objects that move together within a 
disc of some user-specified size. On the one hand, the chosen disk 
size has a substantial effect on the results of the discovery process. 
On the other hand, the selection of a proper disc size turns out to 
be difficult, as situations can occur where objects that intuitively 
belong together or do not belong together are not quite within any 
disk of the given size or are within such a disk. And for some data 
sets, no single appropriate disc size may exist that works well for all 
parts of the (space, time) domain. In Figure[TJa), all objects travel 
together in a natural group. However, as shown in FigureQJi), ob- 
ject 04 does not enter the disc and is not discovered as a member of 
the flock. A key reason why this lossy-flock problem occurs is that 
what constitutes a flock is very sensitive to the user-specified disc 
size, which is independent of the data distribution. In addition, the 
use of a circular shape may not always be appropriate. For exam- 
ple, suppose that two different groups of cars move across a river 
and each group has a long linear form along roads. A sufficient 
disc size for capturing one group may also capture the other group 
as one flock. Ideally, no particular shape should be fixed apriori. 

To avoid rigid restrictions on the sizes and shapes of the trajec- 
tory patterns to be discovered, we propose the concept of convoy 
that is able to capture generic trajectory pattern of any shape and 



any extent. This concept employs the notion of density connection 
1121 . which enables the formulation of arbitrary shapes of groups. 
Given a set of trajectories O, an integer m, a distance value e, and 
a lifetime k, a convoy query retrieves all groups of objects, i.e., 
convoys, each of which has at least m objects so that these objects 
are so-called density-connected with respect to distance e during 
k consecutive time points. Intuitively, two objects in a group are 
density-connected if a sequence of objects exists that connects the 
two objects and the distance between consecutive objects does not 
exceed e. (The formal definition is given in Section[3]) Each group 
of objects in the result of a convoy query is associated with the time 
intervals during which the objects in the group traveled together. 

The efficient discovery of convoys in a large trajectory database 
is a challenging problem. Convoy queries compute sets of objects 
and are more expensive to process than spatio-temporal joins (7), 
which compute pairs of objects. Past studies on the retrieval of 
similar trajectories generally use distance functions that consider 
the distances between pairs of trajectories across all of time 1101 
1 151 1251 . In contrast, we consider distances during relatively short 
durations of time. Other relevant work concerns the clustering of 
moving objects I17I|19||2T1 . In these works, a moving cluster exists 
if a shared set of objects exists across adjacent time, but objects 
may join and leave a cluster during the cluster's lifetime. Hence, 
moving clusters carry different semantics and do not necessarily 
qualify as convoys. 

Jeung et. al. first proposed the convoy query and outlined pre- 
liminary techniques for convoy discovery |4). In this paper, we 
extend the work, which develops more advanced algorithms and 
analyzes each discovery method in real world settings. Specifically, 
we introduce four effective and efficient algorithms for answering 
the convoy query. The first method adopts the solution for mov- 
ing cluster discovery to our convoy problem. The second method, 
called CuTS (Convoy Discovery using Trajectory Simplification), 
employs the filter-refinement framework — a set of candidate con- 
voys are retrieved in the filter step, and then they are further pro- 
cessed in the refinement step to produce the actual convoys. In 
the filter step, we apply line simplification techniques 1111 on the 
trajectories to reduce their sizes; hence, it becomes very efficient 
to search for convoys over simplified trajectories. We establish dis- 
tance bounds between simplified trajectories, in order to ensure that 
no actual convoy is missing from the candidate convoy set. The 
third method (CuTS+) accelerates the process of trajectory simpli- 
fication of CuTS to increase the efficiency of the filter step even 
further. The last method, named CuTS*, is an advanced version of 
CuTS that enhances the effectiveness of the filter step by introduc- 
ing tighter distance bounds for simplified trajectories. 

The main novelties of this paper are summarized as follows: 

• Our filter step operates on trajectories processed by line sim- 
plification techniques; this is different from most related works 
that employ spatial approximation (e.g., bounding boxes) in 
the filter step. The rationale is that conventional methods us- 
ing bounding boxes introduce substantial empty space, ren- 
dering them undesirable for the processing of trajectory data. 

• To guarantee correct convoy discovery, we establish distance 
bounds for range search over simplified trajectories. In con- 
trast, the distance bounds studied elsewhere (§] are applica- 
ble only to specific query types, not to the convoy problem. 

• We study various trajectory simplification techniques in con- 
junction with different query processing mechanisms. In ad- 
dition, we show how to tighten the distance bounds. 

• We present comprehensive experimental results using several 
real trajectory data sets, and we explain the advantages and 
disadvantages of each proposed method. 



The remainder of this paper is organized as follows: In Section[2] 
we discuss previous methods related to the convoy query. We for- 
mulate the focal problem of this paper in Section [3] A modified 
method of moving cluster for the convoy discovery is shown in 
Section|4] We propose more efficient methods based on trajectory 
simplification in Sections [5] and [6] Section|7]reports the results of 
experimental performance comparisons, followed by conclusions 
in Section[8] 

2. RELATED WORK 

We first review existing work on trajectory clustering and, then 
cover trajectory simplification, which is an important aspect of our 
techniques for convoy discovery. We end by considering spatio- 
temporal joins and distance measures for trajectories. 

2.1 Clustering over Trajectories 

Given a set of points, the goal of spatial clustering is to form 
clusters (i.e., groups) such that (i) points within the same cluster 
are close to each other, and (ii) points from different clusters are 
far apart. In the context of trajectories, the locations of trajectories 
can be clustered at chosen time points. Consider the trajectories in 
Figure|2ja). We first obtain a cluster ci at time t = 1, then a cluster 
C2 at t — 2, and eventually a cluster C3 at t — 3. 

Kalnis et al. propose the notion of a moving cluster 1191 , which 
is a sequence of spatial clusters appearing during consecutive time 
points, such that the portion of common objects in any two con- 
secutive clusters is not below a given threshold parameter 9, i.e., 
|c t nc t+1 ^ „ wnere c denotes a cluster at time t. There is a 
significant difference between a convoy and a moving cluster. For 
instance, in Figure [2fa), 02,03, and 04 form a convoy with 3 ob- 
jects during 3 consecutive time points. On the other hand, if we 
set 9 = 1 (i.e., require 100% overlapping clusters), the overlap 
between c\ and C2 is only |, and the above objects will not be dis- 
covered as a moving cluster. Next, in Figure[2lfe), if we set 9 — | 
then ci, C2, and C3 become a moving cluster. However, this is not a 
convoy. 
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Figure 2: Convoys Versus Moving Clusters 

Spiliopoulou et al. 1241 study transitions in moving clusters (e.g., 
disappearance and splitting) between consecutive time points. As 
transitions are based on the consideration of common objects at 
consecutive time points, their techniques do not support convoy 
discovery either. Next, Li et al. 1211 study the notion of moving 
micro cluster, which is a group of objects that are not only close 
to one another at the current time, but are also expected to move 
together in the near future. Recently, Lee et al. 1201 have proposed 
to partition trajectories into line segments and build groups of close 
segments. This proposal does not consider the temporal aspects of 
the trajectories. As a result, some objects can belong to the same 
group even though they have never traveled close together (at the 
same time). Most recently, Jensen et al. 1 1 71 have proposed tech- 
niques for maintaining clusters of moving objects. They consider 
the clustering of the current and near-future positions, while we 
consider past trajectories. 



As mentioned earlier, several slightly different notions of a flock 
1 1 3 1 1 141 relate to that of a convoy. The notion most relevant to our 
study defines a flock as a group of at least m objects staying to- 
gether within a circular region of radius e during a specified time 
interval ||5l 1131 . Al-Naymat et al. J3] apply random projection to 
reduce the dimensionality of the data and thus obtain better per- 
formance. Gudmundsson et al. 1131 propose approximation tech- 
niques and exploit an index to accelerate the computation of flock. 
It is also shown that the discovery of the longest-duration flock is 
an NP-hard problem. It is worth noticing that these studies exhibit 
the lossy-flock problem identified in SectionQ] 

2.2 Trajectory Simplification 

A trajectory is often represented as a polyline, which is a se- 
quence of connected line segments. Line simplification techniques 
have been proposed to simplify polylines according to some user- 
specified resolution 1 1 1 1 1 1 61 . 

The Douglas-Peucker algorithm (DP) 1111 is a well-known and 
efficient method among the line simplification techniques. Given 
a polyline specified by a sequence of T points (pi, f>2, • • • ,Pt) 
and a distance threshold 8, the goal is to derive a new polyline 
with fewer points while deviating from the original polyline by 
at most S. The DP algorithm initially constructs the line segment 
PiPt- It then identifies the point pi farthest from the line. If this 
point's (perpendicular) distance to the line is within S then DP re- 
turns p\Pt and terminates. Otherwise, the line is decomposed alpi, 
and DP is applied recursively to the sub-polylines (pi , p2 , ■ ■ ■ ,p%) 
and (pi, ■ ■ ■ ,pt). As the worst-case time complexity of this al- 
gorithm is 0(T 2 ), Hershberger et al. 1161 show a faster version of 
this method with time complexity of 0(T ■ log T). However, it is 
assumed that an object's trajectory cannot intersect itself, which is 
not a valid assumption for the data we consider. 

The DP technique deals with line simplification only in the spa- 
tial domain, ignoring the time domain of the trajectories. Consider 
the example in Figure [3ja). Since the distance from p2 to pips 
is within S, the DP algorithm omits p2 and simply returns pTps- 
Similarly, q2 is also omitted and the polygon is simplified to qTq3- 

In contrast, Meratnia et al. 1231 take into account the temporal 
aspects in line simplification. Figure [3jZ?) exemplifies the work- 
ing procedure of their algorithm (say, DP*). First, DP* derives 
the point p' 2 on the line pips by calculating the ratio of P2 S time 
between t=\ of pi and t=3 of p3. Then, it measures the distance 
D(p2 , p'2) between P2 and p' 2 , instead of the perpendicular distance 
from p2 to pTp3 . Since D(p2, p'2) > 8, P2 is still kept after the sim- 
plification, while it was removed by using DP in Figure[5]a). 




Figure 3: Comparison of Different Trajectory Simplifications 



2.3 Distance Measures and Joins 

A basic way of measuring the distance between two trajectories 
used in the literature is to compute the sum of their Euclidean dis- 
tances over time points. Such a distance measure may not be able 
to capture the inherent distance between trajectories because it does 
not take into account particular features of trajectories (e.g., noise, 
time distortion). Thus, it is important to devise a distance function 
that "understands" the characteristics of trajectories. 

A well-known approach is Dynamic Time Warping (DTW) 1251 , 
which applies dynamic programming for aligning two trajectories 
in such a way that their overall distance is minimized. More recent 
proposals for trajectory distance functions include Longest Com- 
mon Subsequence (LCSS) [15], Edit Distance on Real Sequence 
(EDR) GO), and Edit distance with Real Penalty (ERP) Jg). Lee 
ct al. [20] point out that the above distance measures capture the 
global similarity between two trajectories, but not their local simi- 
larity during a short time interval. Thus, these measures cannot be 
applied in a simple manner for convoy discovery. 

Given two data sets Pi and P2, spatio-temporal joins find pairs 
of elements from the two sets that satisfy predicates with both spa- 
tial and temporal attributes 1181 . The close-pair join reports all 
object pairs (pi, 02) from Pi x P2 with distance D T (oi, 02) < e 
within a time interval r being bounded by a user-specified distance 
e. Plane-sweep techniques [6 26] have been proposed for evalu- 
ating spatio-temporal joins. Like the close-pair join, the trajectory 
join (7) aims at retrieving all pairs of similar trajectories between 
two datasets. Bakalov et al. (7] represent trajectories as sequences 
of symbols and apply sliding window techniques to measure the 
symbolic distance between possible pairs. These studies consider 
pairs of objects, whereas we consider sets of objects. 

3. PROBLEM DEFINITION 

This section formalizes the convoy problem. We start with the 
definitions of distances for points, line segments, and bounding 
boxes : 

DEFINITION 1. (Distance Functions) 

• Given two points p u and p v , D(p u ,p v ) is defined as the Eu- 
clidean distance between p u and p v . 

• Given a point p and a line segment I, Dpl{p, I) is defined 
as the shortest ( Euclidean ) distance between p and any point 
on I. 

• Given two line segments l u and l v , Dll(Iu,Iv) is defined as 
the shortest ( Euclidean ) distance between any two points on 
l u and l v , respectively. 

• With B u and B v being boxes then D, n i n (B u , B v ) is defined 
as the minimum distance between any pair of points belong- 
ing to each of the two boxes. 

The boxes introduced in the definition will be used for the bound- 
ing of line segments. Next, the time domain is defined as the or- 
dered set {ti, ta, • • • , tr}, where tj is a time point and T is the 
total number of time points. 

In our problem setting, we consider a practical trajectory database 
model. We assume each trajectory may have a different length from 
others and may also appear or disappear at any time in T. In addi- 
tion, each location of a trajectory can be sampled either regularly 
(e.g., every second) or irregularly (i.e, some missing time points 
from T may exist between two consecutive time points of the tra- 
jectory). 

The trajectory of an object o is represented by a polyline that is 
given as a sequence of timestamped locations o = (p a ,p a +i, ■ ■ ■ ,Pb), 
where pj = (xj, yj,tj) indicates the location of o at time tj, with 



to being the start time and tt being the end time. The time inter- 
val of o is o.t = [toi^fc]- A shorthand notation is to use o(tj) for 
referring to the location of o at time tj (i.e., location pj). 

Figure [4] illustrates the polylines representing the trajectories of 
three objects 01, 02, and 03, during the time interval from ti to £4. 




Figure 4: An Example of a Convoy 

As a precursor to defining the convoy query, we need to un- 
derstand the notion of density connection 1121 . Given a distance 
threshold e and a set of points S, the e-neighborhood of a point p 
is given as NH e (p) — {q 6 S | D(p,q) < e}. Then, given a 
distance threshold e and an integer m, a point p is directly density- 
reachable from a point q if p £ NH e (q) and |i\T.ff e (g)| > m. A 
point p is said to be density-reachable from a point q with respect 
to e and m if there exists a chain of points pi, P2, Pn in set S 
such that pi = q, p n — p, and Pi+i is directly density-reachable 
from pi. 

DEFINITION 2. (Density-Connected) Given a set of points S, 
a point p £ S is density-connected to a point q £ S with respect 
to e and m if there exists a point x £ S such that both p and q are 
density-reachable from x. 

The definition of density-connection permits us to capture a group 
of "connected" points with arbitrary shape and extent, and thus to 
overcome the the lossy-flock problem shown in Figure Q] By con- 
sidering density-connected objects for consecutive time points, we 
define the convoy query as follows: 

DEFINITION 3. (Convoy Query) Given a set of trajectories of 
N objects, a distance threshold e, an integer m, and an integer 
lifetime k, the convoy query returns all possible groups of objects, 
so that each group consists of a (maximal) set of density-connected 
objects with respect to e and m during at least k consecutive time 
points. 

Consider the convoy query with the parameters m = 2 and k — 
3 issued over the trajectories in Figure [4] {02,03, [ii,i3]) is the 
result, meaning that 02 and 03 belong to the same convoy during 
consecutive time points from ti to t^. 

Table [T] offers the notations introduced in this section and to be 
used throughout the paper. 

4. COHERENT MOVING CLUSTER (CMC) 

A simple technique for computing a convoy is to first perform 
(density-connected) clustering on the objects at each time and then 
to extract their common objects in an attempt to form convoys. This 
approach is similar to the methods for discovering moving clusters 
1191 . However, those are unable to discover the exact convoy re- 
sults, as explained next: 



Symbol 


Meaning 


V 


Point/location (in the spatial domain) 


t 


Time point 


Oi 


Original trajectory of an object 


TT\ 

Oi(t) 


Location of Oi at time i 


7 

°i 


Simplified trajectory (of 0;) 


7f 

> 


T~- 7 — 7- — 7 

Line segment ot o f 




Time interval of 0. 


if „ 

i 


lime interval ot t . 


D{Pu,Pv) 


Euclidean distance between points 


D PL (p,l) 


The shortest distance from point to line segment 


Dll{Iu, Iv) 


The shortest distance between line segments 


8(i) 


The minimum bounding box of I 




The minimum distance between two boxes 



Table 1: Summary of Notation 

• Let c t and ct+i be (snapshot) clusters at times t and t + 1. 
These clusters belong to the same moving cluster if they 
share at least the fraction 9 objects (|c t n Ct+i|/|ct Uct+i| > 
8), where 9 is a user-specified threshold value between and 
1 . The problem of applying moving cluster methods for con- 
voy discovery is that no absolute 8 value exists that can be 
used to compute the exact convoy results — either false hits 
may be found, or actual convoys may remain undiscovered, 
as explained in Section l2~71 

• A moving cluster can be formed as long as two snapshot clus- 
ters have at least 8 overlap, even for only two consecutive 
time. The lifetime (fc) constraint does not apply to moving 
clusters, but is essential for a convoy. 

• As pointed in the previous section, a trajectory may have 
some missing time points due to irregular location sampling 
(e.g., 03 at t — 2 in Figure 0a)). In this case, we cannot 
measure the density-connection for all objects involved over 
those missing times. 

In order to solve the above problems for convoy discovery, we 
extend the moving cluster method into our Coherent Moving Clus- 
ter algorithm (CMC). First, we generate virtual locations for the 
missing time points. If any trajectory has a location at time ti, but 
another does not during its time interval, we apply linear interpola- 
tion to create the virtual points at ti. Second, to accommodate the 
lifetime (k) constraint, we require each candidate convoy to have 
(at least) k clusters c*, Ct+i, • • • , c t+ fc_i during consecutive time 
points. Third, we test the condition \ct Plct+i PI - • •rict+/b_i| > m, 
to determine whether sufficiently many common objects are shared. 
If all conditions are satisfied, the candidate is reported as an actual 
convoy. 

We proceed to illustrate algorithm CMC using Figure|5] with the 
parameters m = 2 and k — 3. Let c\ be the i-th snapshot cluster 
at time t. Clusters at time t are obtained by applying a snapshot 
density clustering algorithm (e.g., DBSCAN 1121 ) on the objects' 
locations at time t. 




(a) (b) (c) 

Figure 5: Query Processing of CMC, m — 2 

Table [2] illustrates the execution steps of the algorithm. At time 
f 1, we obtain a cluster c\ (with objects 01,02, and 03) and consider 



it a convoy candidate Vi. At time ti, we retrieve a cluster c\, which 
is then compared with v\. Since c\ and Vi have m — 2 common 
objects, we compute their intersection and update candidate vi. At 
time t3, we discover two clusters c\ and c|. Since c\ shares no 
objects with Vi, we consider c\ as another convoy candidate v 2 . 
As cl shares m = 2 common objects with wi, we update vi to 
be its intersection with c|. Eventually, t>i is reported as a convoy 
because it contains m = 2 common objects from clusters during 
k — 3 consecutive time points. 
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ti 
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t 2 
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vi = c[ n c 2 


t 3 


4,4 


wi = c{ n C2 n C3, v 2 = C3 



Table 2: Execution Steps of CMC 

Algorithm [TJ presents the pseudocode for the CMC algorithm. 
The algorithm takes as inputs a set of object trajectories O and 
convoy query parameters m, k, and e. 

We use V to represent the set of convoy candidates. We then 
perform processing for each time point (in ascending order). The 
set Vnext introduced in Line 3 is used to store candidates produced 
at the current time t. Then, we consider only objects o 6 O whose 
time intervals cover time t, i.e., t G o.r. Their locations o(t) are 
inserted into the set Ot ■ If any object o € Ot has a missing location 
at t, a virtual point is computed and then inserted. 

Next, we apply DBSCAN on Ot to obtain a set C of clusters 
(Line 7). The clusters in C are compared to existing candidates 
in V . If they share at least m common objects (Line 11), the cur- 
rent objects of the candidate v are replaced by the common objects 
between c and v and are then inserted into the set V nex t (Lines 13- 
15). At the same time, we increment the lifetime of the candidate 
(Lines 14). Each candidate with its lifetime (at least) k is reported 
as a convoy (Lines 17-18). 

Clusters (in C) having insufficient intersections with existing 
candidates are inserted as new candidates into Vnext (Lines 19- 
23). Then all candidates in Vnext are copied to V so that they are 
used for further processing in the next iteration. 

Algorithm 1 CMC (Set of object trajectories O, Integer m, In- 
teger k, Distance threshold e) 
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V <- 

for each time i (in ascending order) do 

Vnext 

Ot 4- {o(t) I o £ O A t e o.r} 
if Ot- size < m then 

skip this iteration 
C <-DBSCAN(Ot,e,m) 
for each convoy candidate v S V do 
v. assigned <— false 
for each snapshot cluster c S C do 
if |cfl v\ > m then 
v. assigned •<— true 
v <— c n v 
f.endTime «— t 

Vnext <— Vnext U V 

c.assigned «— true 
if v. assigned = false and n.lifetime> k then 

Vresult V reau lt U V 

for each c G C do 

if c.assigned = false then 
c.startTime <— t 
c.endTime <— t 

Vnext ^ Vnext U C 
V <— Vnexf, 

return V resu;i 



5. CONVOY DISCOVERY USING TRAJEC- 
TORY SIMPLIFICATION (CUTS) 

The CMC algorithm incurs high computational cost because it 
generates virtual locations for all missing time points and performs 
expensive clustering at every time. In this section, we apply the 
filter-and-refinement paradigm with the purpose of reducing the 
overall computational cost. For the filter step, we simplify the orig- 
inal trajectories and apply clustering on the simplified trajectories 
to obtain convoy candidates. The goal is to retrieve a superset of 
the actual convoys efficiently. In the refinement step, we consider 
each candidate convoy in turn. In particular, we perform cluster- 
ing on the original trajectories of the objects involved to determine 
whether the convoy indeed qualifies. The resulting CuTS algorithm 
is guaranteed to return correct convoy results. 

5.1 Simplifying Trajectories 

Given a trajectory represented as a polyline o — (pi , P2, • " i Pt), 
and a tolerance S, the goal of trajectory simplification is to derive 
another polyline o' such that o' has fewer points and deviates from 
o by at most S. We say that o' is a simplified trajectory of o with 
respect to 5. 

We apply the Douglas-Peucker algorithm (DP), as discussed in 
Section [2~2l to simplify a trajectory. Initially, DP composes the 
line pTpr and finds the point pi 6 o farthest from the line. If 
the distance DpL(pi,pTpr) < S, segment pTpr is reported as the 
simplified trajectory o'. Otherwise, DP recursively processes the 
sub-trajectories (pi, ■ ■ ■ ,pi) and {pi, ■ ■ ■ ,pr), reporting the con- 
catenation of their simplified trajectories as the simplified trajec- 
tory o' . 

o; 
• o; 





(b) 



Figure 6: Trajectory Simplification 

Figure [6ja) illustrates the application of DP on the trajectories 
in Figured For 01 trajectory, we first construct the virtual line be- 
tween its end points. Since the distance between the farthest point 
(i.e., pi) and the virtual line exceeds S, point pi will be kept in 01 's 
corresponding simplified trajectory o^. Regarding 02, the distance 
of the furthest point (i.e., P2) from the virtual line is below 5; thus, 
all intermediate points are removed from 02's simplified trajectory. 
Figure [6J&) visualizes the simplified trajectories. Notice that each 
point in a simplified trajectory corresponds to a point in the original 
trajectory and is associated with a time value. 

Measuring actual tolerances of simplified trajectories : We ob- 
serve that an actual tolerance smaller than S may exist so that the 
simplified trajectory is valid. In the example of Figure(6fa), the ac- 
tual tolerance of o' 2 is determined by the distance between p2 and 
the virtual line. We formally define the actual tolerance as follows: 

DEFINITION 4. (Actual Tolerance) Let I 1 be a line segment in 
the simplified trajectory o , whose original trajectory is o. The ac- 
tual tolerance 8(1') of I' is defined as: max t& v, T DpL(o(t),l'). 
The actual tolerance 8(0') of o' is defined as the maximum 8(1') 
value over all its line segments. 



The actual tolerance of each line segment l' of o' can be com- 
puted easily by examining the locations of o during the correspond- 
ing time interval I'.t. In addition, the derivation of these tolerance 
values can be seamlessly integrated into the DP algorithm so that 
the original trajectory o needs not be examined again. 

The actual tolerances are valuable in the sense that they can be 
exploited to tighten the distance computation for simplified trajec- 
tories, as we will show in the next section. 

5.2 Distance Bounds for Range Search 

A simplified trajectory d may contain many omitted locations 
in comparison to its original trajectory o. Thus, it is not possible 
to perform (density-connected) clustering at individual time. If we 
generate virtual positions for the omitted points as done in CMC, 
there is no use for the trajectory simplification. The main chal- 
lenge becomes one of performing clustering on the line segments 
of simplified trajectories so that each snapshot cluster (on the orig- 
inal trajectories) is captured by a cluster of line segments (from the 
simplified trajectories). 

In density-based clustering techniques (e.g., DBSCAN), the core 
operation is e-neighborhood search, i.e., to find objects within dis- 
tance e of a given object, at a fixed time t. We proceed to develop 
the implementation of this core operation in the context of line seg- 
ments. Let a line segment l' q be given; our goal is then to retrieve 
all line segments l[ whose original trajectory o, can possibly satisfy 
the condition D(o q (t),Oi(t)) < e for some time point t. This way, 
all qualifying convoy candidates are guaranteed to be found in the 
filter step. 

Let o' q and O; be simplified trajectories of the original trajectories 
o q and Oi. At a given time t, the locations of o q and o; are o q (t) 
and Oi(t). Observe that the endpoints of line segments in o' q are 
timestamped. Let l' q be a line segment in o' q such that its time in- 
terval l' q .r covers t. Similarly, we use l' t to denote the line segment 
in o\ satisfying t £ 1[.t. Figure [7] shows an example of two line 
segments l' g and l[. 
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Figure 8: Range Search with Error Bounds 

segments. To obtain better performance, we intend to prune a sub- 
set S of line segments fast. During the range search of the given line 
segment l' q . Lemma [2] next, enables us to prune an non-qualifying 
S before examining its line segments. The proofs of Lemma[T]and 
Lemma[2]are provided in the appendix. 

LEMMA 2. Let S be a subset of line segments l[ (from simpli- 
fied trajectories). Let B(S) be the minimum bounding box of all 
segments in S, S.r = Ui'gs I'i-T, an d S max {S) = max ; ' g g 8{l'i). 
Let line segment l' q have a time interval that intersects with that of 
S, i.e., S.rnl q .T 0. 

IfD min (B(l' q ),B(S)) > e + S(l' q ) + S max (S) then 
D(o q (t),Oi(t)) > e holds for all l{ 6 5". 

We proceed to outline how to perform range search for l' q in mul- 
tiple steps by gradually tightening the condition: First, we retrieve 
a set of line segments S whose time intervals overlap with that of 
l' q . We then apply Lemma[2]to prune non-qualifying line segments 
in S at an early stage. Next, for each remaining line segment in 
S, we discard non-qualifying line segments by applying Lemmafj] 
Any surviving line segment is included in the e-neighborhood of 
the line segment l' q . Using this multi-step range search for line seg- 
ments, we are able to perform density-connected clustering of line 
segments efficiently. 
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Figure 7: Trajectory Segments with Time Intervals Covering t 

Lemma Q] establishes the relationship between distances in the 
original trajectories and those in the simplified trajectories. 

LEMMA 1. Let o' q (o'i) be the simplified trajectory of original 
trajectory o q (oi). Given a time t, let l' q (l\) be the line segment in 
o' q ( o'i) with a time interval that covers t. 

lfD LL (l' q X) > e + S(l' q ) + 5(l' t ) then D{o q (t), 0l (t)) > e. 

Lemma Q] allows us to prune line segments l\ during the range 
search of the given line segment l' q . Figure[8]illustrates the extended 
range for search over simplified line segments with error bounds. In 
Figure[8]aX half of the points on the original trajectory are omitted 
(i.e., a 50% reduction) with the given 8 value. To enable correct 
discovery processing over the simplified trajectories (dotted lines), 
we enlarge the search space as shown in Figure(8^fe). 

Notice that we still need to scan all l[ whose time intervals in- 
tersect with that of l' q . For example, the time interval [ts,t7] of the 
second line segment of o' q in Figure |9]a) intersects all of o^'s line 



Figure 9: Measure of uj(o' q , o'i) and Time Partitioning 

Extension for trajectories : So far, we have addressed range search 
only for line segments. In fact, it is feasible to generalize the search 
to apply to an entire trajectory. And by applying clustering on tra- 
jectories directly, we further reduce the cost of the filter step. As 
we will see in the next section, the technique below is applicable 
to sub-trajectories as well, enabling us to control the granularity of 
the filter step. 

We aim to retrieve all simplified trajectories o[ whose original 
trajectories o; possibly satisfy the condition D(o q (t),Oi(t)) < e 
for some time t. In case o' q and o'i have disjoint time intervals 
(i.e., o' q .r (~l o'i.r = 0), they cannot belong to the same convoy. 
Otherwise, we define their uj value as follows: 

^(o' q ,o'i) = min{Dz,z,(^,Z-)-5(^)-5(Z-) | Z< G o[, l' q e o' q , 
l' q .Tnl' t .T^Q>}. 

Figure |9ja) shows an example of computing the uj(o q , o'i) value 
between two simplified trajectories o' q and o^. Line segments with 
shared time interval are linked by dotted lines, contributing a term 



in the value of u){o' q , o'i). If oj(o' q ,Oi) > e, no time t exists such 
that D(o q (t),Oi(t)) < e. Otherwise, their locations in the original 
trajectories may be within distance e for some time t. 



5.3 The CuTS Algorithm 

We first present a general overview of the CuTS (Convoy Dis- 
covery using Trajectory Simplification) algorithm, then illustrate 
aspects of the algorithm with examples, and finally present the de- 
tails of the algorithm. 

In the filter step, we first apply simplification (with tolerance 8) 
to the original trajectories in order to obtain their simplified tra- 
jectories. We then partition the time domain (with each partition 
covering A time points) and assign the line segments of each o\ to 
qualifying partitions. Next, we perform clustering on those line 
segments. Clusters across adjacent partitions with common ob- 
jects are used to form convoy candidates. In the refinement step, 
we perform clustering of the original trajectories of the objects in 
each convoy candidate. The total computational cost of the CuTS 
algorithm is the sum of the simplification, the clustering, and the 
refinement costs. Our experiments in Section [7] suggest that the 
simplification and refinement costs are very low in practice. 

To understand the filter step of CuTS better, consider Figure^^) 
where the time domain is divided into equal-length (A = 4) parti- 
tions 71 and Ti with time intervals [ti , ti] and [t4 ,tr], respectively. 
The time partition Ti contains the following line segments: and 
li of o[, l\ of o' 2 , and l\ and Z| of 03. Note that the line segment 
1% will be inserted into both 71 and T2 to avoid any possible false 
dismissal when we compute the value of uj(o' q , o'j) in Figure[9{fl)- 

Algorithm description. Algorithm [2] presents the pseudocode of 
CuTS's filter step. In addition to the convoy query parameters m, k, 
and e, two internal parameters 8 (tolerance for trajectory simplifica- 
tion) and A (the length of each partition) also need to be specified. 
Those parameter values are relevant to the performance only (e.g., 
execution time) and do not affect the correctness. Guidelines for 
choosing their values will be presented in Section [7~4l 

Algorithm 2 CuTS_Filter (Object set O, Integer m, Integer k, 
Distance threshold e) 

1: <5 «- ComputeDelta(0, e) 

2: for each trajectory o; 6 O do 

3: o'i <— Douglas-Peucker(o; , <5) 

4: A <- ComputeLambda(0, k, (£i N)/(Ei KD) 

5: V<-® 

6: divide the time domain into A-length disjoint partitions 

7: for each time partition Tz (in ascending order) do 

8: Vnext <- 

9: for each o'. satisfying o'..t nT z .T 7^ do 

10: insert G c/ (intersecting time interval of Tz) into Q 

11: C «- TRAJ-DBSCAN(S, e, m) 

12: for each convoy candidate v £ V do 

13: v. assigned <— false 

14: for each cluster cgCdo 

15: if c n v\ > m then 

16: v. assigned <— true 

17: v 1 4 — cC\ v 

18: v' .lifetime <— ^.lifetime + A 

19: Vnext <r- Vnext U v' 

20: c. assigned <— true 

21: if v. assigned = false and D.lifetime> k then 

22: V C and <— V ca nd U V 

23 : for each cgCdo 

24: if c. assigned = false then 

25: c. lifetime 0— A 

26: Vnext <— Vnext U C 

27: V <- Vnext 

28: return V ca nd 



Lines 2-3 of the algorithm perform trajectory simplification for 
all objects. Next, the time domain is partitioned, each partition 
holding A consecutive time points. Time partitions are then pro- 
cessed iteratively in ascending order of their time. Let the current 
loop consider the time partition Tz. The algorithm builds a poly- 
line (i.e., a sequence of line segments) from a simplified trajectory 
o' i , which contains the line segments of o' t whose time intervals in- 
tersect to T z - It then stores all the polylines from each simplified 
trajectory into a data structure Q. Next, density clustering is per- 
formed for the sub-trajectories in Q (see Line 1 1). 

The set V keeps track of the convoy candidates found in previ- 
ous iterations, whereas the set Vnext stores new candidates found 
in the current iteration. For Lines 12-20, each cluster c G C (found 
in the current iteration) is joined with those in V, as long as their 
intersections have at least m objects. Also, candidate convoys with 
lifetime above k are inserted into the candidate set Vcand. Clusters 
that cannot join with previous convoy candidates are then consid- 
ered as new candidates (Lines 23-26). 

Finally, Algorithm [3] contains the pseudocode of the refinement 
step of the CuTS algorithm. Suppose that v is the convoy candidate 
in the candidate set V that is currently being examined. We first 
determine the time interval [t start, tend] for v and then identify the 
set O' of the original trajectories whose line segments appear in v. 
Finally, we apply CMC for trajectories in O' , considering only time 
points in the interval [t start, tend]- 

Algorithm 3 CuTS_Refinement (Candidate set V ca nd, Object 
set O, Integer m, Integer k, Distance threshold e) 



for each v € V cand do 
tstart <— start time of v 
tend <- end time of v 

0'*-{oi e o I i{.t evnii.T e 0i } 

call CMC(0',m, k, e) with the time 111161™! [t s tart, t en d 



6. EXTENSIONS OF CUTS 

In this section, we introduce two enhancements of CuTS. One 
accelerates the process of trajectory simplification and brings higher 
efficiency. The other shortens the search range for clustering by 
considering temporal information of trajectories, reducing the num- 
ber of candidates after the filter step of CuTS. 

6.1 Faster Trajectory Simplification - CuTS+ 

The Douglas-Peucker algorithm (DP) utilizes the divide-conquer 
technique (see Section [2~2] l. It is well-known that techniques built 
on the divide-conquer paradigm show the best performance if a 
given input is divided into two sub-inputs equally in each division 
step. Inspired by this, we modify the original DP algorithm for 
speeding up the simplification process, obtaining DP+. 

Specifically, DP+ selects the closest point to the middle of a 
given trajectory among the points exceeding tolerance value 8 at 
each approximation step. Figure [Tol a) demonstrates an original 
trajectory having seven points, which has two intermediate points 
P4 and p6 whose distances from pTp7 are greater than the given 
8 value (the gray area in the figure). The DP method selects the 
point having the largest distance (i.e., ^6); hence, the result of this 
division step will be as shown in Figure lTOt /?). 

In contrast, our DP+ method picks the point P4 that is the closest 
to the middle point of pi,p2, • • • ,P7 among intermediate points 
exceeding 8 (i.e., p4 and pa). This technique divides pTpi into 
two sub-trajectories pTp! and p~TpT, which have similar numbers of 
points (Figure [7ol c)). Therefore, the whole process of trajectory 
simplification is expected to be more efficient. 




Figure 10: Comparison between DP and DP+ 

Compared with DP, DP+ may have lower simplification power. 
In fact, each division process of DP+ does not preserve the shape 
of a given original trajectory well; hence, the next division process 
may not be as effective as that of DP. For example, in Figure Hol e), 
P6 will be kept using DP+ because Dpl(j)6,PaP7) > 5, and then 
the simplified trajectory will be pi,p4,pe,pr, whereas pi,p§,p-j 
will be the result of DP in Figure fToto . 

In spite of the lower reduction, DP+ can enhance the discovery 
processing of CuTS in two areas. First, note that we are interested 
in efficient discovery of convoys in this study. As long as the search 
distances are bounded, faster simplification of trajectories can play 
a more important role in finding convoys. Second, the actual tol- 
erances obtained by DP+ are always smaller or equal to those ob- 
tained by DP (e.g., 64 < Sq in the example). This tightens the error 
bounds of range search for clustering, leading a more effective filter 
step. 

We extend CuTS to CuTS+, which is built on the DP+ simpli- 
fication method. All other discovery processes of CuTS+ are the 
same as those of CuTS. 

6.2 Temporal Extension - CuTS* 

Recall that CuTS applies trajectory simplification (DP) on orig- 
inal trajectories in the filter step. However, as we will see shortly, 
intermediate locations on simplified line segments cannot be asso- 
ciated with fixed timestamps. Consequently, the bounds on dis- 
tances between line segments may not be tight, the result being that 
overly many convoy candidates can be produced in the filter step. 
This may yield a more expensive refinement step. 

In this section, we extend CuTS to CuTS* by considering tem- 
poral aspects for both the trajectory simplification and the distance 
measure on simplified trajectories. This enables us to tighten dis- 
tance bounds between simplified trajectories, improves the effec- 
tiveness of the filter step. 

Comparison between DP and DP*: We discussed the differences 
between the two trajectory simplification techniques DP 1111 and 
DP* j23) in Section [272] In Figure \3jp), DP* translates the time 
ratio of p2 between p\ and P3 into a location p' 2 on the line segment 
P1P3. Since p 2 exceeds the S range of p' 2 , the point pi is kept in the 
simplified trajectory o[, which is different from DP. 

From the example, we can see that DP* has a lower vertex reduc- 
tion ratio for trajectories. Nevertheless, DP* permits us to derive 
tighter distance measures between trajectory segments, improving 
the overall effectiveness of the filter step. 




(a) (b) 
Figure 11: Different Distance Measures of Trajectory Segments 



Figure [TTT a) shows two simplified line segments l[ and 1' 2 , ob- 
tained from DP. Here, l[ has the endpoints p[ and p\, correspond- 
ing to its locations at times t\ and t±. Similarly, 1' 2 has endpoints 
&3 and 65, corresponding to its locations at times i;; and ts. The 
shortest distance between l[ and l 2 is given by Dll(1[, 1' 2 )- 

Figure [TTt fo) contains simplified line segments from DP*. Since 
DP* captures the time ratio in the simplified line segment, we are 
able to derive the locations on l[ and ^(4) on 1' 2 . Let l' p — 
{Pu , Pv } be a simplified line segment having a time interval l' p . r — 
[u, v]. The location of I' at a time t E [u, v] is defined as: 

I'Jt) = p u + — (p* - Pu) 

V — u 

Note that the terms l' p (t), p u , and (p„ — p u ) are 2D vectors repre- 
senting locations. 

Before defining D t (l'i,l 2 ) formally, we need to introduce the 
time of the Closest Point of Approach, called the CPA time (tcpA) 
(6). This is the time when the distance between two dynamic ob- 
jects is the shortest, considering their velocities. Let l' q — {q w , q x } 
be another simplified line segment during I'.t = [w, x\. The CPA 
time of l'p and l' q is computed by : 

-(Pu - Qw) ■ (l' p (t) - l' q (t)) 
CPA - \l' p (t)-l' q (t)\ 2 

where, l' q (t), q w , and (q w — q x ) are also location vectors. 

Observe that the common interval of l[ and l' 2 is [£3,44] (gray 
area in Figure [TTT fc)). The tightened shortest distance D,(l[, 1' 2 ) 
between them is computed as : 

£>*M) = D(l' 1 (t C pA),l 2 (tcPA)) t C pA £ (Zi.Tn&.r) 

When their time intervals do not intersect, i.e., 1[.t n l' 2 .r — 0, 
their distance is set to 00 . 

Clearly, D*(li,l' 2 ) is longer than Dll(1'i,1'2)\ hence, the line 
segments in Figure [TTT fc) have a lower probability of forming a 
cluster together than do those in Figure fTTT a). These tightened dis- 
tance bounds improve the effectiveness of the filter step. 

Distance bounds for DP* simplified line segments: Using the 
notations from Lemma [T] we derive the counterpart that uses the 
tightened distance D* between line segments (as opposed to the 
distance Dll)- Lemma[3]establishes the relationship between dis- 
tances in original trajectories and those in simplified trajectories 
(obtained by DP*). The proof is provided in the appendix. 

LEMMA 3. Suppose that o' q (o[) is the simplified trajectory (from 
DP*) of the original trajectory o q (a). Given a time t, let l' q (l t ) be 
the line segment in o' q (o'i) with time interval covering t. 

IfD t {l' q ,Q > e + 5(Q + 5(li) then D(o q (t), 0i (t)) > e. 

CuTS* algorithm for convoy discovery: We develop an enhanced 
algorithm, called CuTS*, to exploit the above tightened distance 
bounds for query processing. Two components of CuTS need to be 
replaced. First, CuTS* applies DP* for the trajectory simplifica- 
tion. Second, during density clustering in the filter step, Lemma[3] 
is utilized in the range search operations (as opposed to Lemma[T}. 
The above modifications improve the effectiveness of the filter step 
in CuTS*. The following table summarizes the key components of 
CuTS and its extensions. 
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7. EXPERIMENTS 

In this experimental study, we first compare the discovery effi- 
ciency between CMC, which is an adaption of a moving-clustering 
algorithm (MC2) (19) for our convoy discovery problem, and the 
CuTS family (CuTS, CuTS+, and CuTS*). We then analyze the 
performance of each method of the CuTS family while varying the 
settings of their key parameters. 

We implemented the above algorithms in the C++ language on 
a Windows Server 2003 operating system. The experiments were 
performed using an Intel Xeon CPU 2.50 GHz system with 16GB 
of main memory. 

7.1 Dataset and Parameter Setting 

For studying the performance of our methods in a real-world set- 
ting, we used several real datasets that were obtained from vehicles 
and animals. Due to the different object types, their trajectories 
have distinct characteristics, such as the frequency of location sam- 
pling and data distributions. The details of each dataset are de- 
scribed as follows: 

Truck: We obtained 276 trajectories of 50 trucks moving in the 
Athens metropolitan area in Greece |2). The trucks were carrying 
concrete to several construction sites for 33 days while their loca- 
tions were measured. To be able to find more convoys, we regarded 
each trajectory as a distinct truck's trajectory and removed the day 
information from the data. Thus, the dataset became 276 trucks' 
movements on the same day. 

Cattle: To reduce a major cost for cattle producers, a virtual fenc- 
ing project in CSIRO, Australia studied managing herds of cattle 
with virtual boundaries. We obtained 13 cattle's movements for 
several hours from the project. Their locations were provided by 
GPS-enabled ear-tags every second. A distinguishable aspect of 
this dataset is its very large number of timestamps. 

Car: Normal travel patterns of over 500 private cars were analyzed 
for building reasonable road pricing schemes in Copenhagen, Den- 
mark. We obtained 183 cars' trajectories during one week (3). Tra- 
jectories in this dataset had very different lengths. 

Taxi: The GPS logs of 500 taxis in Beijing, China were recorded 
during a day and studied in Institute of Software, Chinese Academy 
of Sciences. The locations of the trajectories were sampled irregu- 
larly. For example, some taxis reported their locations every three 
minutes, while some did it once in several minutes. 

In our experiments, we defined a convoy as containing at least 
3 objects (except Cattle due to the small number of objects) that 
travel closely for 3 minutes (i.e., m = 3 and k = 180). We also 
adjusted the values of neighborhood range e to be able to find 1 to 
100 convoys for each dataset. To perform convoy discovery using 
our main methods (CuTS, CuTS+, and CuTS*), we still need to 
determine two key parameters, namely the tolerance value (<5) for 
trajectory simplification and the length of time partition (A). These 
parameter values were computed by our guidelines that will be dis- 
cussed in Section l7l4l 

Table[3]provides (i) detailed information of each dataset, (ii) the 
settings of the parameters to be used throughout our experiments, 
and (iii) the number of convoys discovered by our proposed meth- 
ods with the parameters. 

7.2 CMC vs. The CuTS Family 

First, we compared the efficiency of CMC versus the CuTS fam- 
ily. Over all the datasets, the CuTS family was 3.9 times (at least) 
to 33.1 times (at most) faster than CMC, as seen in Figure [T2l and 
especially CuTS* had the highest efficiency. The performance dif- 
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Table 3: Settings for Experiments 

ferences were more obvious in the Car and the Taxi datasets though 
their data sizes (total number of points) were less than 10% of Cat- 
tle's data size. Since those two datasets had many numbers of miss- 
ing points and different lifetimes of each trajectory, CMC incurred 
extra computational cost to make virtual points for those missing 
times to measure density-connection correctly (see Section |4). It 
also caused a considerable growth of the actual data size for the dis- 
covery processing. Notice that our main methods, the CuTS family, 
can perform the discovery without any extra processing regardless 
of the number of missing points. 
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Figure 12: Comparisons of Query Processing Time 

In Figure [T3l we report on the elapsed times of each method of 
the CuTS family for the Cattle and Taxi datasets (magnified views 
of the results in Figure [T2l. For brevity, we show the two most dis- 
tinctive results only. In the results for the Cattle dataset, the sim- 
plification cost dominates for all the methods. In general, convoy 
processing is more sensitive to the number of objects N than to the 
number of timestamps T since the clustering method (DBSCAN) 
has 0(N 2 ) computational cost (0(N ■ logN) with a spatial in- 
dex). The Cattle dataset has only 13 objects, and the cost of each 
clustering is very low though it is performed T times. As a result, 
the total discovery times are more influenced by the simplification 
process than the filtering and refinement steps. 
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Figure 13: Analysis of Query Processing Cost 



The reason why each simplification method has different effi- 
ciency will be studied in Section 17.31 Recall that, although the 
CuTS family needs much time for trajectory simplification on the 
Cattle dataset, their total discovery times are still much lower than 
those of CMC in the previous experiments. 

Another interesting observation found with the Cattle data is 
that CuTS+ has not only faster trajectory simplification, but also 
lower refinement cost. This is because DP+ as used in CuTS+ has 
not only higher efficiency of simplification, but also tighter error 
bounds than DP as used by CuTS, as described in Section |6T| 

Compared to the Cattle data, trajectory simplification had very 
low computational cost on the Taxi dataset. As the Taxi dataset has 
a short T but a larger N, the clustering cost dominates the discovery 
time. In addition, since the number of convoy candidates was small 
for this data (will be shown in the next experiments), only little 
refinement was necessary. 

For the other two datasets, the composition of computational 
time was about 70%-80% for filtering (around 5%-15% for trajec- 
tory simplification) and 20%-30% for refinement. Therefore, it is 
very reasonable to 'invest' some time in trajectory simplification. 

We also studied the effect of using the actual tolerance for the 
range search of clustering. When we perform the trajectory simpli- 
fication, we use the tolerance value 8, named the global tolerance 
here. The key process of the simplification is to remove interme- 
diate points whose distances from the virtual line linking two end 
points of the original trajectory do not exceed 5. Any distance of 
those removed points (i.e., actual tolerance) is always smaller than 
or equal to the global tolerance (see Section [5711 . The actual tol- 
erance is useful for range search since the search area should be 
reduced. 
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Figure 14: Effect of Actual Tolerance 

Figure [741 a) demonstrates the filtering power of the global and 
actual tolerances for CuTS*. We omit the results for CuTS and 
CuTS+ because they are similar. As shown, the number of candi- 
dates after the filtering step decreases considerably when we use 
the actual tolerance. The advantage of the improved filtering by the 
actual tolerance is reflected in the efficiency of convoy discovery 
as shown in Figure [T4r fe), Yet, the effect is relatively small on the 
Truck and the Taxi datasets. This is because some candidates that 
do not need much computation for the refinement step are pruned 
when using the actual tolerance. We present a more precise way of 
measuring the filter's effectiveness in the following section. 

7.3 CuTS vs. CuTS+ vs. CuTS* 

We have already discussed different techniques for trajectory 
simplification. The difference between the original Douglas-Peucker 
algorithm (DP) and its temporal extension DP* was covered in Sec- 
tion |2j2] We also developed a DP variant, named DP+, in Section 
16.11 It is of interest to compare the performance of those methods. 

Figure [75la) illustrates the differences of their reduction power 



for the Cattle dataset. We skip the results for the other datasets 
because they show similar trends. With the same values of toler- 
ance, DP shows higher reduction rates than does DP*. This is nat- 
ural since DP* uses the time-ratio distance to approximate points, 
which is always equal to or greater than the perpendicular distance 
of DP (see Section [6~2t . Furthermore, the vertex reduction of DP+ 
is lower than that of DP. This is because DP+ does not preserve the 
shapes of the original trajectories well when compared to DP. This 
aspect was explained in Section RTTI 
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Figure 15: Comparison of Trajectory Simplification Methods 

In Figure [T5t fc). DP+ exhibits the fastest elapsed time among the 
methods because of its more effective division process. An inter- 
esting observation of the figure is that the efficiency of all the meth- 
ods grows as reduction ratios increase. Recall that all the methods 
utilize the divide-and-conquer paradigm, which divides an input 
trajectory until no point exceeds a given 5. With a larger value 
of S, their division processes are likely to meet the 'end' quicker. 
For this reason, DP* also performs slower than the other methods 
(lower reduction power than the others). 

Next, we compare the discovery effectiveness and efficiency for 
the CuTS family. Given very large values of e and 5, the CuTS fam- 
ily may produce one candidate containing all actual results after the 
filter step and then the candidate may be divided into a large num- 
ber of real convoys through the refinement step. Thus, we cannot 
use the count of false positives as a measure of the filters' effective- 
ness for our study. 

Instead, we calculate refinement unit that represents the compu- 
tational cost of candidates for the refinement step, which reflects the 
filtering power of each method effectively. Specifically, the cluster- 
ing cost of the convoy objects in each candidate is computed and 
then multiplied by the candidate's lifetime. As mentioned earlier, 
the computational cost of clustering is either 0(N 2 ) without index 
or 0(N ■ log AT) with a spatial index. To clarify the differences 
of each filter method, we considered the clustering without index 
support in our experiments. For example, if a convoy candidate has 
3 objects and its lifetime is 2, the refinement unit is 3 2 x 2 = 18. 
Next, we aggregate each candidate's unit to obtain the total refine- 
ment unit. 

Figure [16] demonstrates the filtering power and the total discov- 
ery times for the CuTS family when varying S. We omit the results 
for the Truck and Cattle datasets, but those two datasets will be used 
in the next experiments. As expected, CuTS* has the lowest refine- 
ment unit for both datasets, which yields the highest efficiency as 
well. In addition, CuTS+ has a better filtering effectiveness than 
does CuTS. As discussed in Section [6~T1 the actual tolerances ob- 
tained by DP+ of CuTS+ are always smaller or equal to those ob- 
tained by DP of CuTS. As a result, the search range for clustering 
is reduced, and the filtering power grows in the figure. 
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Figure 16: Effect of Simplification Tolerance (<5) 

Another observation found in Figure [T6] for all members of the 
CuTS family is that both the filters' effectiveness and the discovery 
efficiency decrease as the tolerance value increases as the 8 values 
affect not only the result of trajectory simplification, but also that 
of range search for clustering. 

Although the total elapsed times of the Car data grow steadily 
with increasing S, those of the Taxi data stay almost constant or 
increase only very slightly. This is because the enlargement of the 
search range is not sufficient to find more actual convoys with re- 
spect to the given parameters. From this point, we can infer that the 
trajectories of the Taxi dataset are distributed relatively uniformly, 
and thus the number of taxis traveling together within a given (rea- 
sonable) distance is low. 

Lastly, we study how the size of the time partition A affects the 
results of the convoy discovery. In fact, a large value of A yields 
an ineffective filtering step, whereas more times of clustering are 
performed with a small value of A (see Section [5~3V In the Truck 
dataset of Figure [T7] CuTS* shows better performance than the 
other methods regardless of the A value. Also, both the effective- 
ness of the filters and the efficiency of the discovery process de- 
crease when A > 10 for this dataset, for all methods of the CuTS 
family. 

On the other hand, the discovery efficiency of the CuTS family 
declines over the Cattle dataset when A < 30, although their refine- 
ment unit increases steadily in the same range of A. This implies 
that an appropriate A value is influenced by not only the filter's ef- 
fectiveness, but also another fact, possibly the length of trajectories 
since the average size of Cattle's trajectories is very large. 

Another interesting observation found in the Cattle dataset is that 
CuTS+ has similar efficiency to CuTS*, and it is even faster for 
A > 50. As seen in Figure [13] trajectory simplification is the key 
part of the total discovery time on this dataset. Therefore, faster 
trajectory simplification (i.e., DP+) plays a more important role in 
the discovery efficiency in this case. 

7.4 Parameter Determination of CuTS 

Proper values of S and A may be difficult to find in some appli- 
cations since they are dependent on the data characteristics. In this 



Figure 17: Effect of Time Partitioning (A) 

section, we provide guidelines for determining settings for these 
parameters. Note that the parameters do not affect the correctness 
of discovery results, but only affect execution times. 

Tolerance for trajectory simplification (S) : It is obvious that a 
larger value of 8 for DP of CuTS achieves a higher reduction result 
of trajectory simplification. On the other hand, a large 8 value is 
also used for the range search of clustering in the CuTS algorithm; 
hence, the filter step of CuTS may not be tight enough to prune 
many unnecessary candidate objects. In this tradeoff, our goal is 
to find a value satisfying the following conditions : (i) the original 
trajectories become well simplified, and (ii) the distance bounds are 
sufficiently tight, implying an effective filter process. 

As the first step, we perform the original DP algorithm over a 
trajectory with 5 = 0. In each step of the division process (see 
details in Sections [2.2l and [5Tt . we store the actual tolerance values 
in ascending order. Since 8 — 0, the process continues until all 
intermediate points of the original trajectory are tested. 

In the next step, we find the largest variance between two ad- 
jacent tolerances stored, and then select the smaller one of those 
two tolerances. For example, assume that the DP method with 
5 = results in the 10 actual tolerance values 8i, 82, ■ ■ ■ , 810 in 
Figure [T8l fl) through the first step. The difference in the tolerance 
values is the largest between 85 and Sq. We then select 85 as a tol- 
erance value 8, . This selection is performed as long as Si < e (the 
dark gray bars in Figure [T8l fl)). From our experimental studies, 
we found out that the filtering power of the CuTS family decreases 
considerably on some datasets when we pick Si > e. 

Lastly, we perform the above steps for a sufficient time (e.g., 
10% of N) and average the S 3 values selected to obtain a final 8 for 
the processing of trajectory simplification. 

The idea behind this method is to find a relatively small 8 value 
that achieves a reasonable reduction through simplification. In the 
figure, if we pick 810 and apply it to the trajectory simplification, 
the reduction ratio will be nearly 100%. Likewise, the use of 85 
for the simplification is able to yield around a 50% reduction al- 
though it does not necessarily follow the same division processes 
with 8 — as the first step. If we pick 8e instead, it may bring (ap- 



proximately) 60% of trajectory reduction, which is slightly higher 
than 50%. However, the value of 5$ is much bigger than 5s, and 
the effectiveness of range search can decrease dramatically. 
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Figure 18: Value Selection of 8 and A 

Length of time partition (A) : In Section [5~3l we discussed about 
dividing the time domain T into time partitions for discovery pro- 
cessing, each of which has length A. If a time partition % has 
a large value for A, many line segments of a simplified trajectory 
within Ti form a long polyline. Thus, the distances among those 
polylines become small, and many objects are likely to form a clus- 
ter together, leading to ineffective filtering results. In contrast, a 
small value of A involves many computationally expensive cluster- 
ing processes (T/A times). 

Suppose that o\ and o' 2 in Figure [T8l fc) are simplified trajectories. 
In the figure, one clustering with Ai is obviously more efficient than 
two processes with Ao because both cases have the same minimum 
distance between o' x and o' 2 . From this example, we can infer the 
value of Ai by computing W x o.r, where o| (|o'|) is the number 
of points in the trajectory o (o') and o.r is the time interval of o. 

In practice, however, there may be some time points that one 
(simplified) trajectory has, but others do not have, such as p' 2 on 
03 in Figure [T8l c). Using the Ai for this case should not keep the 
filter's 'good' effectiveness, and we need to lower the Ai value. 
We can roughly estimate the probability that such case occurs by 
looking at how densely a trajectory exits in the time space T. No- 
tice that each trajectory may have a different length (o.r) and may 
appear and disappear at any arbitrary time points in T. Thus, the 
density of the trajectory is obtained by o.r/T. Finally, the prob- 
ability that an object has an intermediate time point within Ai is 
(Ai - 2) x o.r/T. Together, we obtain A = Ai -(Ai -2) X o.t/T, 
rewriting A = o.r x ( ^ x (1 - Sf) + f ). 

So far, we have considered the computation of A for a single 
object. To obtain an overall value of A, we perform the above 
computation for all objects and average the values. Note that all 
the statistics for this A computation can be easily gathered when a 
dataset is loaded into the system (or one scan for disk-based imple- 
mentations). 

Although this method does not capture the distribution of a dataset 
precisely, the value of A is quickly obtained and brings reasonable 
efficiency of the CuTS family. 

8. CONCLUSION 

Discovering convoys in trajectory data is a challenging problem, 
and existing solutions to related problems are ineffective at finding 
convoys. This study formally defines a convoy query using density- 
based notions, and it proposes four algorithms for computing the 
convoy query. Our main algorithms (CuTS, CuTS+, and CuTS*) 
use line simplification methods as the foundation for a filtering step 
that effectively reduces the amounts of data that need further pro- 
cessing. In order to ensure that the filters do not eliminate convoys, 
we bound the errors of the discovery processing over the simplified 
trajectories. Through our experimental results with real datasets, 
we found that CuTS* showes the best performance. CuTS+ also 



performes well when the given trajectories have a small number of 
objects and long histories. 
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APPENDIX 

A. PROOFS OF LEMMAS 
A.l Proof of Lemma Q] 

Consider the example of Figure [7] To prove the lemma by con- 
tradiction, assume the following equation holds: 

D(o t (t), 0i (t))<e 

Since l'„ is a line segment (with actual tolerance S(l' q )) in the 
simplified trajectory o' q , there exists a location a q on l' q such that 
D(a q , o q (t)) < S(l'„). Similarly, there exists a location cii on l[ 
such that D(a,i, oj(t)) < S(l'i). Due to the triangular inequality, 

D(a q , ch) < D{a q , o q (t)) + D(o q (t), 0i (t)) + D( 0i (t),ai) 

Combining the inequalities, we obtain: 

D(a q , ai ) <S(l' q ) + e + 5(l[) (1) 

On the other hand, a q (a*) is a location on line segment l' q 
Hence, equation l(2j holds 

D LL {l' q ,l'i) < D(a q ,cn) (2) 

From the last two inequalities (QJ and lO, we get: 

D L L(l' q ,l'i) <e + 6(l' q ) + 6(l[) (3) 

Therefore, the resulting contradiction of proves LemmaQ] 

A.2 Proof of Lemma |2] 

Note that for all 2( £ S, we have 5 max (S) > 5(l'i) and 
D min (B(l' q ),B(S)) < D LL (l' q ,l' % ). If the following equation sat- 
isfies: 

D min (B(l' q ),B(S)) >e + 8{Q + S max (S) 
then, the next equation must also hold: 

D LL {l' q ,l' i )>e + 5{l' q )+5{l' i ) 
The rest of this proof follows directly from LemmaQ] 

A.3 Proof of Lemma |3] 

Since l' q is a line segment (with actual tolerance 8(l' q )) in the 
simplified trajectory o' q , the location l' q (t) meets: 

D(l' q (t),o q (t))<S(l' q ) 

Similarly, the location I'^t) satisfies: 

f(Jj(t),a*(t))<*$) 
In addition, we have: 

D*(l' q> l'i) <D(l' q (t), I'S)) 

The logic of the remainder of the proof is the same as in the proof 
of LemmaQ] 



B. ADDITIONAL EXPERIMENTS 
B.l MC vs. CMC 

In this experiment, we intend to demonstrate empirically that 
methods for the discovery of moving clusters cannot be used to 
compute convoys directly (see Section [2~TV Specifically, we study 
the discovery accuracies of convoys by a solution for moving clus- 
ter (MC2). MC2 reports results of the convoy query if the portion 
of common objects in any two consecutive clusters ci and C2 is not 
below a given threshold parameter 9, i.e., |~^~^ > 9. 

Let R m be a result set of convoys discovered by MC2 and R c be 
another set obtained by CMC (or CuTS). We measure the propor- 
tions of false positives in Figure [T9lfl) by verifying whether each 
convoy v G Rm satisfies the query condition with respect to m, k, 
and e using the results of CMC (i.e., ( |R "^ cl ) X 100). Likewise, 
false negatives in Figure [79? fc) are computed by ( ^fjTT 2 "^ ) x 100). 
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Figure 19: Discovery Quality of the MC method for Convoys 

In fact, MC2 reported bigger numbers of convoys than what 
CMC does because MC2 does not have the lifetime constraint k. 
This feature was especially obvious for the Cattle dataset that is 
larger than the others. As a result, the proportions of actual con- 
voys in the result set were very low, and the numbers of false pos- 
itives were very high in Figure [l9l a). For the other datasets, false 
positives went up as the value grew since the number of con- 
voys reported by MC2 also increased. Let 8 C1C2 be a ratio of com- 
mon objects between two snapshot clusters c\ and C2. Assume that 
there are four consecutive snapshot clusters ci, C2, C3, and C4, and 
9 ci c 2 — 1-0, 9c 2 c 3 = 0.8, 9 C3C4 = 1.0. If we set the value of 9 to 
be equal to or smaller than 0.8, one moving cluster having all the 
snapshot clusters will be reported (say MC C1 C2C3C 4 ). In contrast, 
when 9 > 0.8, MC2 will discover two moving clusters MC C1C2 
and A/C C3C4 . Therefore, a higher 9 value may produce a larger 
number of moving clusters as convoy results. 

Even though MC2 returns many convoys, the result set did not 
necessarily contain all actual convoys. We investigate this aspect by 
computing false negatives in Figure [T9t£>). In general, the number 
of false negatives increases as the 9 value increases because the 
number of convoys discovered by MC2 also increases. Note that 
if many actual convoys exist for different parameter settings, the 
proportions of both false positives and false negatives may increase 
considerably. Therefore, the use of moving cluster methods for 
convoy discovery is ineffective and unreliable. 



